EN FR
EN FR


Section: New Results

Large scale GPU-centric optimization

Participants: J. Gmys, M. Gobert and N. Melab

This contribution is a joint work with M. Mezmaz and D. Tuyttens from University of Mons (UMONS), and T. C. Pessoa and F. H. De Carvalho Junior from Universidade Federal Do Cearà (UFC), Brazil. N. Melab and M. Mezmaz have been the guest editors [7] of a special issue in the CCPE journal on this topic.

Nowadays, accelerator-centric architectures offer orders-of-magnitude performance and energy improvements. The interest of those parallel resources has been recently accentuated by the advent of deep learning making them definitely key-building blocks of modern supercomputers. During the year 2017, the focus has been put on the investigation of these specific devices within the context of parallel optimization. In the following, two major contributions are reported: (1) massively parallel GPU-centric tree-based optimization for solving to optimality big permutation optimization problems; (2) Cuda Dynamic Parallelism (CDP) for backtracking. Another contribution  [2] on the parallel solving of permutation (flow-shop) problems is proposed but not presented here.

  • Massively parallel GPU-centric tree-based optimization. Within the context of the Ph.D thesis (jointly supervised with UMONS, Belgium) of Jan Gmys [1], parallel Branch-and-Bound (B&B) has been revisited for multi-core processors, (multi-)GPU accelerators and MIC coprocessors [6]. During the 2017 year, the focus was put on the extension of these contributions in order to deal with large hybrid clusters. A bi-level parallel approach is actually proposed to revisit the parallel B&B at the intra- and inter-processing node levels. The intra-node level consists in the combination of the previous contributions for an efficient exploitation of the parallelism levels provided inside a processing node which can be a multi-core processor, a GPU, a Xeon Phi or their combination (hybrid processing node). The inter-node (cluster) level deals with the parallelism provided through the use of multiple processing nodes. In the PhD thesis of M. Mezmaz, we have investigated such parallelism by proposing a grid-enabled approach called B&B@Grid, including interval-based work stealing (WS) and checkpointing mechanisms. In addition, each processing node is mainly composed of a single processing core processing a single interval at a time. Our contribution consists in revisiting those mechanisms to deal with multi- and many-core resources. Indeed, for instance a GPU explores thousands of intervals. The contribution includes a bi-level WS mechanism together with a multi-interval checkpointing mechanism. The proposed approach has been experimented on the OUESSANT GPU cluster located at IDRIS, Paris. The results show that, on the road to the exascale era, our approach scales up to 130.000 Cuda cores, reducing the execution time from 25 days (using B&B@Grid) to 9 hours.

  • Cuda Dynamic Parallelism (CDP) for backtracking. New GPGPU technologies, such as CUDA Dynamic Parallelism (CDP), can help dealing with recursive patterns of computation, such as divide-and-conquer, used by backtracking algorithms. During 2017, in collaboration with University of Cearà (Brazil), we have investigated the CDP mechanism using highly irregular algorithms. Indeed, we have proposed a GPU-accelerated backtracking algorithm using CDP that extending a well-known parallel backtracking model. The algorithm analyzes the memory requirements of subsequent kernel generations and performs no dynamic memory allocation on GPU, unlike related works from the literature. The proposed algorithm has been extensively experimented using the N-Queens Puzzle problem and instances of the Asymmetric Traveling Salesman Problem (ATSP) as test-cases. The proposed CDP algorithm may, under some conditions, outperform its non-CDP counterpart by a factor up to 25. Compared to other CDP-based strategies from the literature, the proposed algorithm is on average 8× faster.